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Abstract. - In the framework of the Poland Scheraga model of DNA denaturation, we derive 
a recursion relation for the partition function of double stranded DNA, allowing for mismatches 
between the two strands. This relation is studied numerically using standard parameters for 
the stacking energies and loop entropies. For complementary strands (of length 1000), we find 
that mismatches are significant only when the cooperativity parameter a is of order one. Since 
a ~ O(10~ 5 ) in DNA, entropic gains from mismatches are overwhelmed by the energetic cost of 
opening a loop: mismatches are therefore irrelevant (and molecular recognition of the strands 
is perfect). Generating random mutations with probability p on one strand, we find that large 
values of a are rather tolerant to mutations. For realistic (small) values of cr, the two strands 
do not recombine, even for small mutation fractions. Thus, molecular recognition is extremely 
selective. 



Introduction. - Natural DNA exists as a double-helix strand. Upon heating, the two 
strands may separate. This unbinding transition is called DNA denaturation. The reverse 
process of binding is called renaturation, recombination, or, in a more biological wording, 
recognition. 

The Poland-Scheraga model of DNA denaturation ^2121 ^ as ^ een recently revisited from 
rather different points of view g]|o]|6]|7||8] . The physics approach [o]|d]|7||8] considers the phase 
transition of the homopolymeric model (a single pairing or stacking energy). The transition is 
a first order transition, if the excluded volume interaction between the two strands is properly 

si 

taken into account. In a nutshell, one models the weight of a loop of I bases as where 
s is the entropy per base pair in a loop, c is an exponent describing intra- and inter-strand 
excluded volume, and the cooperativity parameter a is generally taken as unity. For c > 2, 
a first order (denaturation) transition occurs. Previous models relied on a value of c ~ 1.8, 
corresponding to a second order transition. 

On the other hand, Yeramian 0] used the original Poland-Scheraga model (with c < 2) to 
study the denaturation transition of real DNA sequences (that is with 10 different stacking 
energies). Apart from minor differences (stacking vs. pairing energies, inclusion of the entropy 
s in the Boltzmann weight), this realistic approach relies on the use of a factor a ~ 10~ 5 — 10~ 6 , 
in marked contrast to the value a — 0(1) used by physicists. The physical origin of a is the 
overlap of the 7r orbitals of the (deoxy)ribose rings of stacked bases. A small a value means 
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that opening a loop is very costly, and leads to a very sharp (cooperative) first order like 
denaturation transition. This sharpness allows Yeramian to identify coding regions in real 
DNA sequences as having a higher melting temperature than the non-coding ones. The 
interplay between the values of c > 2 and a was recently studied jHj for some DNA sequences, 
in particular with respect to the Meltsim program ^H], where c = 1.75 and a ~ 1CP 5 . 

An important puzzle for physicists is how the extreme selectivity required by the biological 
machinery can be achieved in spite of the very high entropy of non selective binding. For 
instance, in a DNA micro-array, single strands of DNA are grafted on a surface. When 
this array is immersed in a solution containing complementary and mutated strands, the 
recognition process occurs with a high accuracy, with a very low rate of errors. 

In this letter, we address these issues, allowing for mismatches in the renaturation process. 
Our line of thought is as follows: if loops are not costly energy-wise, one may a priory consider 
that mismatches between the two strands (i.e. base a of the first strand being paired or stacked 
with base (3 of the second) play an important role. In the homopolymeric case, it is easy to 
see that the sole effect of mismatches is to replace the exponent c by c — 1 . 

What we want to do in this letter is to adopt a pragmatic point of view, leaving theoretical 
ideas for a longer paper We will mostly work at c = 1.8, using different values of the 

cooperativity parameter a. We have checked that our results do not change significantly when 
using c = 2.15 as suggested in ref.( [H]). 

The plan of the paper is as follows: we first recall a few facts on the algorithmic aspects 
of the problems with- and without- mismatches, and how to speed up calculations. The 
comparison between the two cases will be first studied for complementary sequences. For a 
small, mismatches can be ignored, while they are essential for a ~ 0(1). Finally, we consider 
the case of non complementary sequences, and argue that small values of a are not very 
tolerant to mutations, in marked contrast to the case a ~ 0(1). 

Summary of Poland-Scheraga theory. - We first consider two complementary strands of 
length N, and we denote by Z(a) the partition function of the two strands originating at 
base 1 and paired at base a. Using Z(l) = l,Z(2) = e~P ei > a,1 > a as boundary conditions, the 
original Poland-Scheraga (PS) calculation can be recast in the recursion relation 



where e a _i a -,a—i,a 1S the stacking energy between bases a — 1 and a, a is the cooperativity 
parameter (that is the loop fugacity), and Af(2(a — a')) is the number of closed loops of length 
2 (a — a'). Equation expresses the fact that pairing of base pair a may result either from 
stacking base pairs a — 1 and a or closing a loop at a' . The scaling form of the number of 
closed loops of length I is given by [T2"] 



where s is the entropy per base in a loop, and c is a critical exponent, depending on the model 
used for the polymeric model. For a Brownian chain, one has c = 3/2, whereas c ~ 1.8 for 
a self-avoiding chain. For interacting self-avoiding loops, it has been argued that one should 
use c~ 2.15 0. 

Note that eq.Q can be backtracked (at least at low temperature) in order to determine 
which fragments of the chain are bound, and which ones are unbound. For that purpose, one 
iterates the recursion till a = N, keeping track of all the Z(a). Going from N to N — 1, 
one keeps in Q the one term of the r.h.s. which is dominant. If it is the first term, it implies 



a-2 




(1) 



Af(l)=e sl /l c 



(2) 
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that the last pair is stacked, whereas if it is one of the other terms of the sum, say a' , there is 
a loop which ends at a! . By iterating this procedure, one ends up with the full conformational 
structure of the chain. 

We recall that the stacking energies have been shown to describe nucleotides interactions 
in a much more accurate fashion than simple base pairing. The idea is that in addition to the 
usual Crick- Watson pairing, there is a big component of the binding energy originating from 
the stacking of the ribose rings. In practice, this is modeled by assuming that the interactions 
depend on pairs of adjacent bases on the two strands, namely are of the form Ei^+i-jj+i 
instead of e^.j. Since there are four different bases, there are in principle 4 4 = 256 possible 
stacking energies, out of which only 10 turn out to be non zero. 

In the homogeneous case £ a — !.,<*■<*— l a = e i a Laplace transform with respect to a allows 
to solve exactly equation ft), yielding a continuous (if 1 < c < 2) or discontinuous (if c > 
2) transition. For (c < 1), the strands are bound at all temperatures. Note that in the 
homogeneous case, the transition temperature is given by 



c s -io g (i+,Er.ii/; c ) 

This equation shows the very weak dependence of the critical temperature on the loop expo- 
nent c for physical values of a (~ 10~ 4 — 10~ 6 ). 

From a more practical perspective, solving numerically equation (JJJ requires a CPU time 
of order N 2 . The Fixman-Freire method ^3E] reduces this CPU time to order N by ap- 
proximating the loop factor by 

i-E^e-M (4) 
i=i 

In equation J2J) the number / of couples (ai,6j) depends on the desired accuracy. For a 
sequence of length 1000, the value 1 = 9 gives an accuracy better than 0.5%. Larger values 
(I = 14) are used in 4 for lengths of 150000 base pairs or in the program Meltsim \U)\ . 

To be complete, a special treatment is required for the extremities. Indeed, the number of 
conformations of a segment of length I at the extremities of the chain is 

Af(l) ~ e sl P- x (5) 

where the exponent 7 ~ 1.15. Thus, one should slightly modify the above recursion at the 
origin to account for this fact. 

Mismatch. - We again consider two strands of length TV and N', and denote by a (resp. 
0) the base index of the first (resp. the second) strand. Let Z(a, 0) be the partition function 
with bases a and paired (in these notations, the PS model corresponds to a = 0). Using the 
same kind of boundary conditions as in the previous Section, Z(l, 1) = l,Z(a,l) = Z(l,a) = 
0, for any a = 2, ...,N(N') and Z(2,2) = e^ 1 * 1 . 2 , we may write 

Z(a, (3) =Z(a-l,(3- 1) e -P E «-i,««>-i,f> 

a-l /3-1 

+^^(1- V.a-i) V,/3-i))^(«'> -a' +0-13') (6) 

a' = l /3'=1 

The interpretation of equation © is the same as in equation (JJJ . In the homogeneous case 
(£ a —i,a;P—i,0 — £ )i one may again solve the problem: the results are the same as the (PS) 
model, except that c is replaced by c — 1, e.g. a discontinuous transition occurs for c — 1 > 2. 



(3) 
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This equation can be backtracked exactly as was explained in the previous section to obtain 
the conformations of the chains at low temperature. 

In the following, we shall be interested in strands of equal lengths N. From an algorithmic 
point of view, equation © shows that the calculation of the partition function, with our 
boundary conditions, is of order iV 4 (in the conclusion, we briefly discuss the influence of 
the boundary conditions on this result). In what follows, we have used a Fixman-Freire 
approximation to reduce this N power to a factor N 2 x I 2 . Similar reductions were previously 
obtained in studies of the denaturation of circular DNA . 

Defining 

a p 

Sj(a,0) = e - b ^ a+ ^ E ^ (a '+^Z(a',0') (7) 

a' = l /3'=1 

for j = 1, I, it can easily be seen that these functions satisfy the recurrence 
Sj(a,0) = e-^(5j(a,/3- 1) + S,-(a - 1,(3)) - e^S^a - 1,(3- 1) + W a p 
+ V a JSj(a - 1,0- 1) - e- b J(Sj(a - 1,0 - 2) + Sj(a - 2,0 - 1)) + e" 26 ' Sj(a -2,0- 2)\ 

(8) 

where W a p = a"Xw=i aie~ b 'S[(a -1,0-1) and V a p = e _/3e °- 1 °>' 3 - 1 .' 3 - a. The boundary 
conditions are 

S,-(l,l) = l, Sj(a > l) = S i (l,a)=e-^( a - 1 ) ) S {2, 2) = e~ 2b ' + e - 0s ^ 2 (9) 

for any j = 1, I and a = 2, N. 
The partition function is given by 

Z(a,0) = Sj(a,0) - e~ b i(S 3 (a -1,0) + S 3 {a,0- 1)) + e~ 2bi Sj{a -1,0-1) (10) 

It is clear from these equations that the algorithmic complexity to calculate the full parti- 
tion function is N 2 x I 2 instead of TV 4 . With a number of components / ~ 10, this allows to 
study sequences of size up to 1000 in a reasonable computer time. 

Our numerical calculations have been done on the first 1000 bases of a fugu gene p?- We 
have worked with 1 — 9 (oj, h) pairs. Since we are interested in finite sequences, we took 
the Flory value c — 1.8 for the loop exponent (for a small, the melting curves for real DNA 
sequences are very insensitive to the value of c, as can be inferred from eq.Jjyl). Finally the 
stacking energies are the ones used in the program Meltsim 10 . 

The order parameter in DNA denaturation is usually taken as the fraction of bound base 
pairs, denoted as 8. This quantity can be measured by UV absorption at 268 nm ^7]. The 
derivative —d0/dT with respect to temperature displays sharp peaks at the temperatures 
where various fragments of the sequence open. In the following, for practical reasons, we shall 
study the specific heat rather than the order parameter. Indeed, for a homopolymer, the 
fraction 9 is proportional to the internal energy of the chain, and thus —d9/dT is proportional 
to the specific heat. In the case where there is a non homogeneous sequence, it can be 
easily checked that the peaks of the specific heat coincide with those of the order parameter 
derivative. 

Complementary strands: the role of a. - Using equations and ©, we have compared 
the specific heat of the PS model with the specific heat of the model with mismatches. 
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Fig. 1 Fig. 2 

Fig. 1 - Specific heat for a = 10 -5 : the curves with and without mismatches are undistinguishable 
Fig. 2 - Specific heat with (continuous curve) and without (long dashed curve) mismatches for a = 1 



First, in the homogeneous case (single stacking parameter) the two models differ, since 
the PS model undergoes a continuous transition (with our choice c = 1.8), whereas the model 
with mismatches does not have a phase transition and remains bound at all temperatures 
(c — 1 — 0.8 < 1). We have solved the recursion for a homopolymer of size 1000. For a = 1. 
the specific heats of the two models are wildly different, whereas for a = 10 -5 there is a sharp 
peak in both specific heats. These peaks are located exactly at the same place, the difference 
between the two cases being that the height of the peak in the mismatch case is finite, whereas 
it scales with the size of the chain for the PS case. 

We now consider the reference sequence defined above, for different values of er. For 
a = 10 -5 (Fig. the curves can be superimposed, whereas the two models differ considerably 
for a = 1 (Fig. EJ). This means that for physical (small) values of er, mismatches between 
two complementary strands are negligible at any temperatures. The physical reason is that 
although mismatches are entropically favorable, they are suppressed by the high energetical 
cost of initiating a loop. This cost vanishes for a ~ 1. 

Random mutations: the role of a. - We have generated random mutations on one strand, 
starting from the complementary sequence. Our procedure is as follows: we mutate each base 
on this strand independently with probability p; if the mutation is accepted, the mutated base 
is different from the old one. Equation J^J is then implemented numerically. 

For small values of <r, the melting temperatures decrease with increasing values of p, as 
does the sharpness of the peaks as shown in (Fig. |31and[HJ for typical samples. Mutations 
that are far along the sequence are disfavored and the strands "unbind" more easily. For a 
mutation rate p of order 0.5% (corresponding to a typical number of mutated bases between 4 
and 6 in the reference sequence), recognition between the strands is already poor. For larger 
values of a (Fig. the presence of mismatches considerably lessens the role of mutations. 

Conclusion. - In this letter we have studied why mismatches can be neglected in DNA 
denaturation and recognition. We have argued that this is related mostly to the small value 
of the cooperativity parameter a. We have seen that it is this very small value of a which 
forces the very strong specificity in the DNA renaturation process. In other words, a defines 
a length scale Zj, ~ 1/a which characterizes the typical distance between loops. For a realistic 
value of a ~ 10~ 5 , this amounts to lengths of order 100000 base pairs. One expects significant 
effects of mismatches only beyond that length scale. 

We have checked that all our results are very insensitive to the loop exponent c, as long as 
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Fig. 3 



Fig. 4 



Fig. 3 - Effect of mutations: Specific heat for p — 0.005 (continuous curve) and p — (long dashed 
curve) for a = 10~ 5 . 



Fig. 4 - Effect of mutations: Specific heat for p — 0.005 (continuous curve) and p — (long dashed 
curve) for a = 1 



the loop fugacity a is small enough. Finally, the backtracking procedure can provide physical 
informations about the boundaries of helical regions. 

With our fixed boundary conditions (where the two extremities of each strands are fixed 
and paired) we have reduced the CPU time from ./V 4 to N 2 x I 2 through the use of a Fixman- 
Freire approximation. For general boundary conditions, we would get another TV 4 factor from 
relaxing the constraints on the extremities. This last factor can also be scaled down. Indeed, 
consider adding phantom paired links to the two extremities of each strands, namely Q ,0/3 
and, (N + l) a ,(N + The stacking energy of these phantom bases is taken to be 0. By 
this addition of phantom bases at the origin and at the extremity of the two strands, we allow 
effectively for open or closed boundary conditions on the two strands. The error in doing so is 
to replace the true P^ 1 by l~ c at the extremities of the chain. This simplification which affects 
only the extremal segments of the two strands does not change the algorithmic complexity of 
the recursion and the CPU time remains of order N 2 x I 2 . 

Acknowledgements. - We thank E. Yeramian for discussions and for introducing us to 
the Fixman-Freire scheme. 
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Fig. 5 - Effect of mutations: Specific heat for p — 0.005 (continuous curve), p = 0.01 (dotted curve 
with crosses) and p = 0.03 (long dashed curve), for a = 10~ 5 
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